Problem Statement:

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. ReCell, a startup aiming to tap the potential in this market and want to analyze the data provided and build a linear regression model to predict the price of a used phone and identify factors that significantly influence

Importing necessary libraries and data

Load the dataset

Data Overview

Convert the datatype of several columns from object to category

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

Create functions to create plots

EDA

Univariate analysis

Let's explore the dependent variable used-price

Let's explore the dependent variable new-price

Let's explore the dependent variable days_used

Let's explore the dependent variable weight

Let's explore the dependent variable 5G

Bivariate Analysis

Let's look at correlations.

Observations

Let's look at the graphs of a few variables that are highly correlated with used_price.

used price vs new price vs 5G status

used price vs new price vs 4G status

used price vs ram capacity

used price vs selfie cam mp

check the used price with release year

Answer to key questions

1. Explore the distribution of the dependent variable- used phone price

Observations

2. Percentage of Android phone

3. How does the amount of RAM vary with the brand?

4. How does the weight vary for phones offering large batteries (more than 4500 mAh)?

5. How many phones are available across different brands with a screen size larger than 6 inches?

6 inches = 15.24 cm

6. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?

7. Which attributes are highly correlated with the used phone price?

Data Preprocessing

Missing value treatment

Let's fix the missing values.

Now we don't have any missing value

Features Engineering

Outlier detection and treatment

Outlier Treatment

EDA

Univariate analysis

Bivariate Analysis

Look at the Correlation

Observation

using pairplot to visualise

used price vs new price vs 5G status

used price vs new price vs 4G status

brandname vs used price

check used price vs 5g vs release year

check used price vs 4g vs release year

check used price vs year

selfie camera vs used price

ram vs used price

days used vs used price

Building a Linear Regression model

We want to predict the used phone price.

Before we proceed to build a model, we'll have to encode categorical features.

We'll split the data into train and test to be able to evaluate the model that we build on the train data.

We will build a Linear Regression model using the train data and then check it's performance.

Split the dataset

Model Building

Let's check the coefficients and intercept of the model.

Let make predictions on the test set (X_test) with the model, and compare the actual output values with the predicted values.

Model performance evaluation

We will check the performance of the model using different metrics.

We will be using metric functions : RMSE, MAE, and 𝑅2 .

We will define a function to calculate MAPE and adjusted 𝑅2 .

The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0. We will create a function which will print out all the above metrics in one go.

Observations

Model building - Statsmodel

Observations

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

General Rule of thumb:

- If VIF is between 1 and 5, then there is low multicollinearity.
- If VIF is between 5 and 10, we say there is moderate multicollinearity.
- If VIF is exceeding 10, it shows signs of high multicollinearity.

Removing Multicollinearity

To remove multicollinearity

  1. Drop every column one by one that has a VIF score greater than 5.
  2. Look at the adjusted R-squared and RMSE of all these models.
  3. Drop the variable that makes the least change in adjusted R-squared.
  4. Check the VIF scores again.
  5. Continue till you get all VIF scores under 5.

Observations

Now no feature has p-value greater than 0.05, we will consider the features in x_train3 as the final ones and olsmod2 as the final model

Observation

Now we'll check the rest of the assumptions on olsmod2.

  1. Linearity of variables

  2. Independence of error terms

  3. Normality of error terms

  4. No Heteroscedasticity

TEST FOR LINEARITY AND INDEPENDENCE

How to check linearity and independence?

TEST FOR NORMALITY

Why the test?

How to check normality?

How to fix if this assumption is not followed?

TEST FOR HOMOSCEDASTICITY

Why the test?

How to check for homoscedasticity?

How to fix if this assumption is not followed?

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.

Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.

Let's compare the initial model created with sklearn and the final statsmodels model.

Final Model Summary

Let's recreate the final statsmodels model and print it's summary to gain insights.

Actionable Insights and Recommendations

Recommendation